Melbourne Property Market Analysis

Synopsis

Data related to Melbourne’s property market have been attained online from domain.com.au. The goal of this analysis is to better understand the property market, generate a predictive model, and compare accuracy of machine learning methods. OpenStreetMap was used to import additional data to the dataset before an array of machine learning models were fitted and ensembled using stacking. It has been found that gradient boosted decision trees perform best on this task although a final conclusion is yet to be published.

Introduction

This analysis concerns Melbourne’s property market, a topic of much debate in today’s social and political climate. The focus of this analysis is to better understand the contributing factors to property value. Change in value over time is not addressed here.

The primary objectives of this analysis are threefold:

  • Prediction: It would be useful to be able to predict the sale price of a property.

  • Description: Understanding what makes a valuable property in Melbourne is useful knowledge for any home owner, it also makes for an interesting analysis.

  • Model Assessment: We’re going to fit a variety of models. Their accuracy will be compared, along with any strengths and weaknesses specific to this dataset.

The R code for the analysis can be viewed by hitting the ‘code’ buttons along the way, alternatively you can show all code chunks by selecting the option in the dropdown menu at the top right of the page.

Primary Models

Three models are fitted to the data, each with its own strengths, chosen to mask the weaknesses of the others. K-nearest neighbours with kernel smoothing is used to capture local similarity not only geospatially but throughout the entire feature space. Gradient boosted decision trees are used for their robustness and successful track record. Finally, a generalised additive model is fitted, fitting splines to the feature-saleprice relationship, then fitting a linear combination of these nonlinear functions.

5 fold cross validation is used. 10 folds were used at first, but this was decreased to 5 to reduce variance in score during validation. This likely increases bias of the validation score, (overestimating error), but accuracy estimation is not the goal of this part of the analysis.

Category Encodings

Before fitting, it is useful for tree and neighbourhood models to recast the factor variables type and method to be numeric. They are encoded in order of their mean price. When doing target encoding, it is often wise to regularise the ordering by using kfold or expanding mean encodings, but in this case, with only three levels and an obvious separation in price, there is little danger of including a leak from the target log(price) in the training data.

ggplot(property_data,aes(x=type,y=log(price))) + geom_boxplot(fill='#0078D7')

The method feature tells a similar story, details are omitted.

encode_type <- function(df) {
    df$type_encoded = revalue(df$type,replace=c('Unit'='1','Townhouse'='2','House'='3')) %>% as.numeric
    return (df)
}
encode_method <- function(df) {
    df$method_encoded = revalue(df$method,replace=c('SP'='1','SA'='2','S'='3')) %>% as.character %>% as.numeric
    return (df)
}

Gradient Boosted Trees

The model fitting begins with gradient boosted decision trees (GBDT), implemented with the xgboost package, an R api for the popular software library of the same name. GBDT is one of the most widely applicable and robust machine learning techniques. It is able to capture interactions and non-linear trends without explicit encoding. For more information on xgboost, you can find the xgboost documentation here.

Polar Coordinates

Looking above at the map aggregated by suburb, it appears that property value may be better parameterised with polar coordinates. Let’s set this up now with a function:

#----compute distance from city and bearings------------------------------------
polarise <- function(lnglat_centre,df) {
    locs = select(df,c(lng,lat))
    df$dist_cbd = apply(locs,1,function(loc){distm(loc,lnglat_centre)})
    df$bearing =apply(select(df,c(lng,lat)),1,function(x){
        bearing(lnglat_centre,x)
    })
    return(df)
}

A variety of features were experimented with, ultimately only a small subset were chosen for the model.

Tuning the number of trees was performed via early stopping; once the validation score stops improving and starts to increase, training is stopped. Other hyperparameters were tuned manually, such as maximum tree depth max_depth, observation subsampling ratio subsample and feature subsampling ratio colsample_bytree. After these were optimised, the learning rate was decreased in order to get a better fit.

The graphic below shows the score as a function of iterations during training, with the validation score shown in orange, training in blue. The dashed lines show one standard deviation above/below the mean score over the folds.

#---------------------prepare data for xgboost api---------------------------
xgb_train0 <- encode_type(train0) %>% encode_method

#add polar coordinates
MELBOURNE_CENTRE <- c(144.9631,-37.8136)
xgb_train0 <- polarise(MELBOURNE_CENTRE,xgb_train0)

xgb_features <- c(
               'building_area'
              ,'dist_cbd'
              ,'bearing'
              ,'year_built'
              ,'type_encoded'
              ,'nrooms'
              ,'land_area'
              ,'method_encoded'
              ,'nbathroom'
              ,'ncar'
              ,'ntrain_3000'
)

xgb_train0_y <- log(xgb_train0$price)
xgb_train0_x <- xgb_train0 %>% select(xgb_features)

#prepare the data in an xgboost specific data structure
Dtrain0 <- xgb.DMatrix(data=as.matrix(xgb_train0_x),label=xgb_train0_y)

#------------------------ setup parameters--------------------------------------
PARAMS <- list(
    seed=0,
    objective='reg:linear',
    eta=0.005,
    eval_metric='rmse',
    max_depth=7,
    colsample_bytree=0.78,
    subsample=0.80
    
)

#------------------------cross validation---------------------------------------

set.seed(45785)
cv_obj <- xgb.cv(
    params=PARAMS,
    data=Dtrain0,
    nthreads = 4,
    folds=folds,
    verbose=FALSE,
    nrounds=1e+6,
    early_stopping_rounds=2000
    )

#save the number of trees that gives the best out of fold generalisation
OPTIMAL_NROUNDS <- cv_obj$best_ntreelimit

#-----------------------------plot----------------------------------------------

#plot training information
eval_log <- cv_obj$evaluation_log
ggplot(eval_log,aes(x=iter)) + 
  coord_cartesian(ylim=c(0,0.4)) + xlab('iteration') + ylab('score') + 
  ggtitle('Validating Xgboost model') +
  geom_line(aes(y=train_rmse_mean),lwd=1,col='#0078D7') +
  geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
  geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
  geom_line(aes(y=test_rmse_mean),lwd=1,col='orange') +
  geom_line(aes(y=test_rmse_mean + test_rmse_std),lwd=1,col='orange',lty=2) +
  geom_line(aes(y=test_rmse_mean - test_rmse_std),lwd=1,col='orange',lty=2) 

The figure below shows the gain of each feature in the model after it is retrained on all of train0.

#----------------generate out-of-fold (oof) predictions-------------------------

#initialise the out-of-fold predictions vector
xgb_oof_preds <- rep(NA,nrow(train0))

for (i in 1:KFOLD) {
    fold <- folds[[i]]
    
    #prepare fold specific training data
    train0_x_fold <- xgb_train0_x[-fold,]
    train0_y_fold <- xgb_train0_y[-fold]
    Dtrain0_fold <- xgb.DMatrix(data=as.matrix(train0_x_fold),label=train0_y_fold)
    
    #prepare fold specific validation data
    val0_x_fold <- xgb_train0_x[fold,]
    val0_y_fold <- xgb_train0_y[fold]
    Dval0_fold <- xgb.DMatrix(data=as.matrix(val0_x_fold),label=val0_y_fold)
    
    #fit the model
    model <- xgb.train(params=PARAMS,
                       data=Dtrain0_fold,
                       nrounds=OPTIMAL_NROUNDS,
                       verbose=0)
    
    #save out of fold predictions
    preds <- predict(model,newdata=Dval0_fold)
    xgb_oof_preds[fold] <- preds
}

#compute out of fold error
xgb_oof_error <- sqrt(mean((xgb_oof_preds-xgb_train0_y )^2))

#---------------------- train model on all train0-------------------------------

#Retrain the model on all train0, this will be used to predict on the ensemble
#validation set, the predictions will be used to 
#validate the ensembling meta model
xgb_model <- xgboost(data = as.matrix(xgb_train0_x), 
                     label = xgb_train0_y , 
                     params = PARAMS,
                     verbose=FALSE,
                     nrounds = OPTIMAL_NROUNDS
)

#----------------------------feature importance---------------------------------

#Plot feature importances for this model
xgb_fi <- xgb.importance(model=xgb_model)
xgb.plot.importance(xgb_fi,measure='Gain',main='Feature Importance (gain)')

Now the xgboost model is used to predict on ensemble_validation for later ensembling.

#----------------create predictions for the ensemble fold-----------------------

#These predictions will be used to validate the ensembling meta model
xgb_ensemble_validation <- encode_type(ensemble_validation) %>% encode_method %>% 
    polarise(MELBOURNE_CENTRE,.)

xgb_ensemble_validation_y <- log(xgb_ensemble_validation$price)
xgb_ensemble_validation_x <- xgb_ensemble_validation %>% 
    select(xgb_features) %>% as.matrix

xgb_ens_preds <- predict(xgb_model,newdata <- xgb_ensemble_validation_x)

Below are individual conditional expectation (ICE) plots for the most relevant features. These are the same as partial dependence plots only they are disaggregated; each thin line shows the estimated impact on predicted log(price) the X variable has for each of the individual observations. By visualising data in this way, we better understand the reliability of the plot and can more easily identify interactions. If the thin lines behave differently for different values of the X variable, then X might be interacting with another variable. You can learn more about ICE plots by reading this paper. The authors of this paper also wrote the R package ICEbox, used to generate these plots. The highlighted line is the mean of the individual conditional expectation plots, which is identical to a partial dependence plot.

Pay careful attention to the tick marks down the bottom, detailing the distribution of data points along the X-axis. Note also that, to reduce demand on computational resources, only 20% of points, selected randomly, are used for the below figures.

plot_xgb_ice <- function(model,X,y,feature) {
    colour <- rep(rgb(0,0,0,0.05),nrow(X))
    ice_obj = ice(model,
                  X=as.matrix(X),
                  y=y,
                  predictor=which(feature == names(X)),
                  verbose=F,
                  frac_to_build=0.2)
    plot(ice_obj,plot_orig_pts=F,colorvec=colour,xlab=feature)
}
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'building_area')

plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'dist_cbd')

plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'bearing')

plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'year_built')

K-Nearest Neighbours

Ensembling tends to perform better when combining models of a different structure. We’ve already fitted a tree based model, now a nearest neighbours model is fitted.

This model makes use of kernel smoothing; the relevance of each neighbouring data point is scaled by a kernel function, a monotonically decreasing function of distance. The kknn package is used, optimised with caret. The epanechnikov kernel was chosen ahead of time. The kknn package adjusts the kernel to appropriately weigh each dimension of the data, so that the model is constant under shifts and dilations of the feature space; there is no need to scale or centre the feature space.

#---------------------------prepare data----------------------------------------

knn_train0 <- encode_type(train0)
knn_features <- c('building_area'
              ,'lng'
              ,'lat'
              ,'year_built'
              ,'type_encoded'
              ,'nrooms'
              ,'land_area'
)

knn_train0_y <- log(knn_train0$price)
knn_train0_x <- knn_train0 %>% select(knn_features)

#---------------Prepare for caret's optimisation--------------------------------

#caret's trainControl function demands training indicies
#simply passing -folds won't work
train_folds <- lapply(folds,function(f){which(!(1:nrow(knn_train0) %in% f))})

#using folds created earlier, create an object to pass to caret 
#for cross validation
trainingParams <- trainControl(index=train_folds,
                              indexOut=folds,
                              verbose=FALSE)

#create a dataframe of hyperparameter values to search
tunegrid <- data.frame(kmax=rep(1:30,each=1),
                      distance=rep(1,30),
                      kernel=rep('epanechnikov',30)
)

#-----------------------------Run Caret-----------------------------------------

#setup parallel computing cluster
cl <- makePSOCKcluster(4)
registerDoParallel(cl)


#Find the optimal kmax hyperparameter
knn_model <- train(x=knn_train0_x,
                  y=knn_train0_y,
                  method='kknn',
                  metric='RMSE',
                  tuneGrid = tunegrid,
                  trControl = trainingParams
                  
)

The figure below shows validation score as a function of the number of neighbouring points included. Based on this a neighbourhood of 10 is chosen.

#-------------------------------------------plot--------------------------------

res <- knn_model$results
res <- select(res,c(kmax,distance,RMSE,MAE))
ggplot(res,aes(x=kmax)) + geom_line(aes(y=RMSE),lwd=2,color='#0078D7')

#save optimal hyperparamaters
tuned <- knn_model$bestTune
#---------------Create out-of-fold (OOF) predictions----------------------------

knn_oof_preds <- rep(NA,nrow(knn_train0))

#iterate over folds
for (i in 1:KFOLD) {
    fold <- folds[[i]]
    
    #prepare fold specific data
    train0_x_fold <- knn_train0_x[-fold,]
    train0_y_fold <- knn_train0_y[-fold]
    val0_x_fold <- knn_train0_x[fold,]
    val0_y_fold <- knn_train0_y[fold]
    
    #call caret's train() to find the optimal kmax
    model <-  train(x=train0_x_fold,
                   y=train0_y_fold,
                   method='kknn',
                   metric='RMSE',
                   tuneGrid=tuned,
                   trControl = trainControl('none')
                   
    )
    
    #compute prediction on the remaining fold
    preds <- predict(model$finalModel,newdata=val0_x_fold)
    
    
    #add to the OOF predictions
    knn_oof_preds[fold] <- preds
}
#stop the cluster
stopCluster(cl)

knn_oof_error <- sqrt(mean((knn_oof_preds-knn_train0_y)^2))

#--------------create predictions on the ensemble validation set----------------
knn_ensemble_validation <- encode_type(ensemble_validation)
knn_ensemble_validation_y <- log(knn_ensemble_validation$price)
knn_ensemble_validation_x <- knn_ensemble_validation %>% select(knn_features)
knn_ens_preds <- predict(knn_model,newdata=knn_ensemble_validation_x)

Generalised Additive Model

Now an additive model is fitted using the gam package. This model is fitted last as it relies the most on an understanding of the data.

The model expression was constructed manually, the result of first exploration and then tuning each spline’s degrees of freedom. Caret was not used this time. It has a habit of oversimplifying the fitting interface.

#---------------------------prepare data----------------------------------------
train0_gam <- polarise(MELBOURNE_CENTRE,train0)

gc <- gam.control(maxit=100)

form <- formula(
    log(price) ~ 
    type + #this is a factor
    method + #this is a factor
    factor(nbathroom)+factor(nrooms) +
    s(building_area_log,df=20) +
    s(dist_cbd,df=10) +
    s(bearing,df=28) +
    s(year_built,df=15) +
    s(land_area_log,df=10)
    + nsupermarket_2000 +
    ntrain_3000
)

#---------------------------Generate out of fold predictions--------------------

#initialise
gam_oof_preds <- rep(NA,nrow(train0))

for (i in 1:KFOLD) {
    fold <- folds[[i]]
    
    train0_gam_fold <- train0_gam[-fold,]
    val0_gam_fold <- train0_gam[fold,]
    
    gam_model <- gam(formula = form ,
                     data = train0_gam_fold ,
                     #control = gc,
                     family='gaussian'
    )
    
    #add out of fold predictions for this fold
    preds <- predict(gam_model,newdata=val0_gam_fold)
    gam_oof_preds[fold] <- preds
}

#compute error
gam_oof_error <- sqrt(mean((gam_oof_preds-log(train0_gam$price) )^2))

Plotted below are a few of the splines fitted by the model, notice that, while the scales are different, these figures somewhat resemble the ICE plots produced by the model trained with xgboost.

terms_to_plot <- c("s(building_area_log, df = 20)" ,"s(dist_cbd, df = 10)" , "s(bearing, df = 28)" ,"s(year_built, df = 15)"   )
plot(gam_model,terms = terms_to_plot)

#----------retrain the full model to generate features fo rthe meta model-------

gam_model <- gam(formula = form ,
                 data = train0_gam ,
                 #control = gc,
                 family='gaussian'
)

#--------------Generate metafeatures for the ensembling meta-model--------------

#These predictions will be used to validate the ensembling meta model
gam_ensemble_validation <- encode_type(ensemble_validation) %>% 
    polarise(MELBOURNE_CENTRE,.)
gam_ens_preds <- predict(gam_model,newdata <- gam_ensemble_validation)

Ensembling

#---------------------------------------------data prep-------------------------

#setup ensembling train data by combining predictions on train0
#into a dataframe
ensemble_train_x <- cbind(xgb_oof_preds,knn_oof_preds,gam_oof_preds) %>%
    as.data.frame
ensemble_train_y <- log(train0$price)

#set up ensembling validation set by combining predictions on 
#ensemble_validation into a dataframe
ensemble_validation_x <- cbind(xgb_ens_preds,knn_ens_preds,gam_ens_preds) %>%
    as.data.frame
ensemble_validation_y <- log(ensemble_validation$price)

Let’s have a look at a summary of the primary models so far:

#-----------------------------Model Comparison----------------------------------

#accuracy with target
acc_df <- sapply(ensemble_train_x,function(x){
    sqrt(mean((x - log(train0$price) )^2))
}) %>% as.data.frame
names(acc_df) <- 'Accuracy (rmse)'
row.names(acc_df) <- c('xgboost','knn','gam')
acc_df %>% kable
Accuracy (rmse)
xgboost 0.1620494
knn 0.2037288
gam 0.2032041
#correlations with target
cor_with_target <- sapply(ensemble_train_x,function(x){
    cor(x,log(train0$price)) %>% round(3)
}) %>% as.data.frame
names(cor_with_target) <- 'correlation with log(price)'
row.names(cor_with_target) <- c('xgboost','knn','gam')
cor_with_target %>% kable
correlation with log(price)
xgboost 0.947
knn 0.916
gam 0.916

The large discrepancy in accuracy between model performance is concerning, but stacking is still worth exploring.

The smaller the correlation between existent predictions, the better. Taking a look at the correlation matrix:

#correlations with each other
cor(ensemble_train_x) %>%  round(3) %>% as.data.frame %>% kable
xgb_oof_preds knn_oof_preds gam_oof_preds
xgb_oof_preds 1.000 0.960 0.963
knn_oof_preds 0.960 1.000 0.942
gam_oof_preds 0.963 0.942 1.000

with correlations in the 0.90’s, it seems possible we could squeeze out a little extra performance boost.

The ensembling will be performed via stacking with xgboost, features lat and lng from the original data are included as well for a slight performance increase.

#----------add features from the primary models to the meta features------------
feats <- c('lng','lat')

#get base features from the ensemble_validation set to combine with
#predictions from primary models
val_feats <- polarise(MELBOURNE_CENTRE,ensemble_validation) %>% encode_type %>%
    encode_method %>% select(feats)

#do the same for training
train_feats <- train0 %>% encode_type %>% encode_method %>% select(feats)

#combine these features to the ensembling data
ensemble_train_x <- ensemble_train_x %>% cbind(train_feats) 
ensemble_validation_x <- ensemble_validation_x %>% cbind(val_feats)

#---------------------------generate watchlist---------------------------------
#create a watchlist, xgboost will use the last element, (the validation set)
#when early stopping
watchlist<-list()
watchlist[['train']] <- xgb.DMatrix(data = as.matrix(ensemble_train_x),
                                   label =ensemble_train_y )
watchlist[['validation']] <- xgb.DMatrix(data = as.matrix(ensemble_validation_x),
                                        label = ensemble_validation_y)

#---------------------fit model-------------------------------------------------

#set params
PARAMS <- list(
    seed=0,
    objective='reg:linear',
    eta=0.001,
    eval_metric='rmse',
    max_depth=1,
    colsample_bytree=1,
    subsample=1
)

#fit the model
ens_mod <- xgb.train(
    data = watchlist$train,
    params =  PARAMS,
    nrounds =  1e+6,
    watchlist =  watchlist,
    verbose=FALSE,
    early_stopping_rounds = 1000
    )

#---------------------------------------------plots------------------------------
eval_log <- gather(ens_mod$evaluation_log,key='dataset_measure',
                   value='score',c(train_rmse,validation_rmse))
eval_log$dataset_measure <- factor(eval_log$dataset_measure)

ggplot(eval_log,aes(x=iter,y=score,color=dataset_measure)) + 
    coord_cartesian(ylim=c(0.1,0.4)) +
    geom_line(lwd=2)

The ensembling method proruced an error of 0.164443 on the validation set.

xgb_ens_fi <- xgb.importance(model=ens_mod)
xgb.plot.importance(xgb_ens_fi,measure='Gain')

It seems that xgboost was star of the show. The ensembled model achieved an error on the validation set of 0.164443. Let’s compare this now to the error of the primary models, trained on all of train0, these are conveniently stored in the data used to validate the ensembling meta-model.

#-------------------comparison with primary models
toprint <- sapply(ensemble_validation_x,function(x){
    sqrt(mean((x - log(ensemble_validation$price) )^2))
}) %>% as.data.frame
names(toprint) <- 'Accuracy (rmse)'

#dont evaluate the accuracy of lng and lat
subset(toprint,c(T,T,T,F,F)) %>% kable
Accuracy (rmse)
xgb_ens_preds 0.1631929
knn_ens_preds 0.1979917
gam_ens_preds 0.2070017

It seems we’re better off simply using the model fitted with xgboost.

Discussion

A few comments on the analysis:

It was surprising to the author to see the OpenStreetMap data almost irrelevant in modelling property value. ntrain_3000 was included in the forest of gradient boosted decision trees fitted with xgboost, but its contribution to the validation score was minor.

A decision has been made to hold off on the test set for now, at least until other modelling options are explored. Other methods to consider include neural networks and support vector machines.

Plans exist to build a web application with shiny. This will be a partial dependence plot, aggregated by suburb, of the predictive model. The user will be able to add conditions to customise the information plotted based on their needs, such as the number of bathrooms, building area etc. This will give a clearer idea of the location premium on the value of a property of a particular type. This application will require an analysis of the accuracy beyond root mean square error to better understand the distribution of residuals and compute confidence intervals for predicted property sale price.

If you have any questions, comments or ideas, you can reach me at jordan.james.sands@gmail.com .